NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Alleviating the Fear of Losing Alignment in LLM Fine-tuning

https://doi.org/10.1109/SP61157.2025.00171

Yang, Kang; Tao, Guanhong; Chen, Xun; Xu, Jun (May 2025, IEEE)

Large language models (LLMs) have demonstrated revolutionary capabilities in understanding complex contexts and performing a wide range of tasks. However, LLMs can also answer questions that are unethical or harmful, raising concerns about their applications. To regulate LLMs' responses to such questions, a training strategy called alignment can help. Yet, alignment can be unexpectedly compromised when fine-tuning an LLM for downstream tasks. This paper focuses on recovering the alignment lost during fine-tuning. We observe that there are two distinct directions inherent in an aligned LLM: the aligned direction and the harmful direction. An LLM is inclined to answer questions in the aligned direction while refusing queries in the harmful direction. Therefore, we propose to recover the harmful direction of the fine-tuned model that has been compromised. Specifically, we restore a small subset of the fine-tuned model's weight parameters from the original aligned model using gradient descent. We also introduce a rollback mechanism to avoid aggressive recovery and maintain downstream task performance. Our evaluation on 125 fine-tuned LLMs demonstrates that our method can reduce their harmful rate (percentage of answering harmful questions) from 33.25% to 1.74%, without sacrificing task performance much. In contrast, the existing methods either only reduce the harmful rate to a limited extent or significantly impact the normal functionality. Our code is available at https://github.com/kangyangWHU/LLMAlignment
more » « less
Free, publicly-accessible full text available May 12, 2026
On Large Language Models’ Resilience to Coercive Interrogation

Zhang, Zhuo; Shen, Guangyu; Tao, Guanhong; Cheng, Siyuan; Zhang, Xiangyu (May 2024, ACM/IEEE)

Full Text Available
RULER: discriminative and iterative adversarial training for deep neural network fairness

https://doi.org/10.1145/3540250.3549169

Tao, Guanhong; Sun, Weisong; Han, Tingxu; Fang, Chunrong; Zhang, Xiangyu (November 2022, Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering)

Full Text Available
RULER: Discriminative and Iterative Adversarial Training for Deep Neural Network Fairness

Tao, Guanhong; Sun, Weisong; Han, Tingxu; Fang, Chunrong; Zhang, Xiangyu (January 2022, Proceedings of the 2022 ACM SIGSOFT International Symposium on the Foundations of Software Engineering)

Full Text Available
MIRROR: Model Inversion for Deep Learning Network with High Fidelity

An, Shengwei; Tao, Guanhong; Xu, Qiuling; Liu, Yingqi; Shen, Guangyu; Yao, Yuan; Xu, Jingwei; Zhang, Xiangyu (January 2022, Proceedings of the 29th Network and Distributed System Security Symposium)

Full Text Available
Better Trigger Inversion Optimization in Backdoor Scanning

https://doi.org/10.1109/CVPR52688.2022.01301

Tao, Guanhong; Shen, Guangyu; Liu, Yingqi; An, Shengwei; Xu, Qiuling; Ma, Shiqing; Li, Pan; Zhang, Xiangyu (January 2022, IEEE/CVF Conference on Computer Vision and Pattern Recognition)

Full Text Available
OSPREY: Recovery of Variable and Data Structure via Probabilistic Analysis for Stripped Binary

https://doi.org/10.1109/SP40001.2021.00051

Zhang, Zhuo; Ye, Yapeng; You, Wei; Tao, Guanhong; Lee, Wen-chuan; Kwon, Yonghwi; Aafer, Yousra; Zhang, Xiangyu (May 2021, 2021 IEEE Symposium on Security and Privacy (SP))
null (Ed.)
Full Text Available
Backdoor Scanning for Deep Neural Networks through K-Arm Optimization

Shen, Guangyu; Liu, Yingqi; Tao, Guanhong; An, Shengwei; Xu, Qiuling; Cheng, Siyuan; Ma, Shiqing; Zhang, Xiangyu (January 2021, Proceedings of the 38th International Conference on Machine Learning)

Full Text Available
Correlations between Deep Neural Network Model Coverage Criteria and Model Quality

https://doi.org/10.1145/3368089.3409671

Yan, Shenao; Tao, Guanhong; Liu, Xuwei; Zhai, Juan; Ma, Shiqing; Xu, Lei; Zhang, Xiangyu (October 2020, Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’20),)
null (Ed.)
Full Text Available
TRADER: trace divergence analysis and embedding regulation for debugging recurrent neural networks

https://doi.org/10.1145/3377811.3380423

Tao, Guanhong; Ma, Shiqing; Liu, Yingqi; Xu, Qiuling; Zhang, Xiangyu (June 2020, Proceedings of ICSE)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records